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COMMUNICATIONS INTERCONNECTION the outputs of the switching system. Conventional mecha- 

NETWORK WITH DISTRIBUTED nisms for resequencing usually ensure that packets are 

RESEQUENCING delivered in the correct order, but under unusual conditions, 

they can fail to reorder packets correctly. 

Tk/VFNjTlO^iS STATEMENT AS TO RIGHTS Tq CiVENTIGNSJ 5 A known method for resequencing ATM packets in a 

a.rv VOIV MADE UNDER UNIVERSITY SPUNSUKbU multistage interconnection network uses timestamps and 

RESEARCH AND DEVELOPMENT time-ordered output queues. Such systems require a central 

' . . - J . r 1 , timing reference. The central timing reference signal is 

TTie mvention was performed m the course of work under distributed to the circuits at aU inputs to the network and to 

sponsorship of Washmgton University of St^I^uis, Mo., and ^^^^^ ^ ^^^^^ ^^^^ 

GrowthNetworksIncofMountamView Calif.Eachentity 10 ^^^^^^^ . ^^^^^^ ^.^^ ^ .^^^^^^ .^^^ ^ 

clauns jomt ownership in the mvention. Growth Networks ti^^^estarap field in the packet. When the packet emerges 

Inc. has exclusive rights under hcense from Washmgton ^j^^ ^^^^^^^ appropriate destination, the times- 

University. ^^^^ ^ insert the packet into a lime-ordered queue. 

FIELD OF THE INVENTION as other words, the packets are read from the queue in 

increasing order of their timestamp values. Associated with 

This invention relates to communications information ^^^^ threshold or timeout which specifies the 

routing; and more specifically to switch interconnection minimum time that must elapse between the time a packet 

networks or switch fabrics. entered the interconnection network and the time it is 

BACKGROUND OF THE INVENTION 20 allowed to leave the resequencing buffer at the output. If the 

difference between the current time and the timestamp of the 

Buffered multistage interconnection networks are often f^^x packet in the buffer is smaller than the age threshold, 

used in Asynchronous Transfer Mode ("ATM") and other then the first packet is held back (along with all others 

types ofpacket switching systems. Networks of this type use "behind" it in the buffer). This allows packets that are 

buffers to store packets at intermediate points when conten- delayed for a long period of time in the interconnection 

tion for output links prevents immediate transmission. As network to catch up with other packets that experience 

used herein, the term packet is used to indicate generically smaller delays. 

addressable data packets of all types, including fixed length jf ^jj^ threshold is larger than the maximum delay that 

cells and variable length packets. packets ever experience in the interconnection network, then 

Many multistage interconnection networks provide mul- the time-based resequencing method will always deliver 

tiple paths between network inputs and outputs, allowing the packets in order. However, if packets are delayed by more 

trafiSc to be balanced across the alternative paths. An than the time specified by the age threshold or timeout, 

example of such a network is shown in FIG. lA, which is a errors may occur. In typical systems, delays in the intercon- 

depiction of an architecture known as a three stage Benes nection network are usually fairly small (a few packet times 

network. The Benes network 10 is composed of three stages 35 per stage) and packets only rarely experience long delays, 

of switch elements (SE) 12, 14, 16 and two webs of On the other hand, a worst-case delay may be very large. As 

interconnecting finks (IL) 18, 20. Switch elements 12, 14, 16 a result, the age threshold is usually set not to the worst-case 

may have any number of input type interconnecting links 24 delay, but to a smaller value chosen to give an acceptably 

and output type interconnecting links 26. Letting d denote small fi-equency of resequencing errors. The resequencing 

the number of input finks and the number of output finks in 40 delay can be traded off against reduced probability of 

a single switch element and letting n denote the number of resequencing errors. Conventional time-based resequencing 

input links and output links of the multistage network as a methods can be implemented in various ways, 

whole, FIG. lA illxistrates d=4 and n=16. Traffic distribution Merge sorting, illustrated in FIG. IB, is a known tech- 

in a multistage network 10 is commonly done in one of two nique for producing a single sorted list from multiple 

ways for a given end-to-end session of sending packets from 45 ordered lists whose values are known a priori. For example, 

an identified input to an identified output of the network. In two fists 920 and 922 of known elements sorted in ascending 

systems that use static routing, all packets associated with a order can be combined into a single sorted list 924 by 

given session follow the same path through the intercon- repeatedly taking the smaller value from the front of fists 

nection network. (In ATM networks, a session is typicaUy 920 and 922, and appending the smaUer value to the end of 

associated with a virtual circuit.) This path is selected when 50 fist 924. This example can be extended to a set of n known 

the session begins and typically remains fixed until the values, which can be sorted by first dividing the set into n 

session is completed, thus guaranteeing that packets belong- fists containing a single value each, then combining pairs of 

ing to the same session are forwarded in the order they lists to produce n/2 lists with two values-e^h. Pairs of these 

arrived. fists are then merged, producing n/^^list^th four values /j*5^^ 

In systems that use dynamic routing, traffic is distributed 55 each. Continuing in this fashion evemuaUy yields a single 

on a packet-by-packet basis so as to equalize the traffic load sorted list containing the original vafii^, but in sorted order, 

across the entire interconnection network. Dynamic routing as shown in FIG. IB. Merge sortin^n\lso be implemenifidiC^i^^UO 

systems can distribute traffic more evenly than systems that using three-way merging (that is, mSiging three sortec ^istj ItSt^ 

use static routing. Consequently, dynamic routing systems into a single sorted list in one step), rather than by using 

can be operated with lower speed internal finks than are eo two-way merging. Morq ^eneij^ d-way merging can be used CiC^/)C/'i 

needed for systems that use static routing. However, because for any integer d>l. «— / 

dynamic routing systems do not constrain the packets However, known sorting techniques are not candidates for 

belonging to a single user session to a single path, they may resequencing packets in a packet switching environment. In 

allow packets in a given session to get out of order, that is, such a dynamic environment, lists of known values are not 

lose sequence. 65 available. Rather, these packet switching systems must rese- 

Systems using dynamic routing typically provide rese- quence streams of packets. Moreover, in dynamic rese- 
quencing mechanisms to restore the correct packet order at quencing systems, a complete set of packets to be rese- 
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126, 127 can be combined into a single sorted Stream of data number will be generated and hereinafter called a floor 
packets in Stage 1 FIFO buffer 134 by repeatedly taking the indication. Floor indications are shown in RG. 3 as a pair of 
packet with the smallest timestamp available from the dashes, as in slots of elements 52, 60, 62, 66, 70, 72, 76, 80 
buffers 126 and 127 and inserting it into the buffer 134. Both (slot 110), 84, and 86, and a number, indicating the times- 
packets shown in Stage 1 FIFO buffer 134 in FIG, 2B came 5 lamp value. The presence of a floor indication with value j 
from source 127 in FIG. 2A, because both of these packets ^ buffer means that any cells arriving in the future are 
had smaUer timestamps than any from source 126. This is guaranteed to have a timestamp at least as large as i. 
not an uncommon occurrence. « . « . ■ . 

n. -a* • .u- r * r i * FIG. 7 IS a flowchart illustrating the stcps performed m an 

Proceedmg m this fashion, an arbitrary set of packets ... . r.i. • . . . i j • 

J* * u * ^ u ir u 1 *• embodiment of the invention by each switch element dunng 

distributed across n source buffers can be sorted by selecting m n j c j • • n ». 

. 1 * -^u *i_ 11 *• * t- • r 1" a cell time, denned as the time reqmred for one ceU to be 
the packet with the smaller timestamp m each pair of source , ^ - V /• * . i . 
u fc I'tn i-in 111 11'^ 111 i_ transferred from the output of a switch element to a down- 
buffers 126, 127; 128, 129; 130, 131; 132, 133, as shown in ^ % u 1 * xKr^ A u . . ^1 
i-ir^ jf J- *u * 1 * * c m 1 streamswitch element. FIG. 4 shows the State of the network 
FIG, 2A, and forwarding that packet to one of n/2 Stage 1 rr-T^ i n i . a. . ^ r n i_ 
T-ir-/^ u n- ii.< iie fi/: ^ -i ^ i. • r-rr^ of FIG. 3 one ccU time later. At the start of a cell time, each 
FIFO buffers 134, 135, 136, 137 as shown in FIG. 2B. The -.1.1 . • 1 ^ « /o A^ 

* J • /A cl nrr^ u rr i m i switch clcmenl cxammcs its arrival buffers (Step A). For 

process IS repeated in n/4 Stage 2 FIFO buffers 120, 122 as . -luir *cii*i. -^u/ 

i_ • T-T/-^ -1/- J 7* J ■ i-Tm u re I'l^ -^^ each amval buffer that is not full, the switch element 

shown m FIG. 2C and completed in a FIFO buffer 124 as , . . • 1 . .1. ■ . • i^l 

u • rrr^ u £ * i * l u * j trausmits a grant signal to the appropriate upstream neighbor 

shown in FIG. 2D, where the first n packets have been stored , * • 5- «■ * • r r *u * • 

J J KT * *i. * 1 * **i. w * element, mdicatmg that it is safe for the upstream neighbor 

m sorted order. Note that packets with later timestamps . ^ r „ ^. ,V . j 

. „ • ♦ *u . J ^ J- • -1 element to transmit a cell (Step B), No cells are transmitted 

contmue to flow mto the system and are sorted m a similar , , i / i- -j r^.^ 

r w • .u • * i7¥/^c -^A T>i -11 * ♦ CTr^c betweeu elements unless a grant signal is received. Other 

fashion in their turn, as FIGS. 2A-D illustrate. FIGS. «n o * 1 ^ u • u 5 

-J. „ u • 1-c J u J- . i?*!- • *• flow control techmques may be used. 

2A-2D show a very simplified embodiment of the mvention ^ 

with only a single output and with no indication of the ^^^O' departure buffer in the upstream neighbor element 

timing or control of the merging streams of timestamps. ^^^^ received a grant signal and has at least one ceU sends 

Other embodiments of the invention provide for multiple first ceU. Every departure buffer in the upstream neighbor 

outputs and/or accommodating the timing or control of the element that has received a grant signal but has no ceU sends 

merging streams of timestamps. ^ floor indication. The timestamp value of the floor indica- 

Certainembodimentsoftheinventionfurthersenseempty minimum of the timestamps of cells in the 

lists by means of a status message. In one such embodiment, l^f^l"^ neighbor element arrival buffe^. For example, 

a status message is an inserted packet having a timestamp ^ ^^ows that in switch elemerit 70 t^^ ^""^ 

value, as hereafter explained. A controUer uses the times- 30 t^^^stamp m amval buffers 200, 201 2()2, 203 is 5 ; thus, 

tamp value of the inserted packet to assure that all members ^ floor indication with timestamp J has been placed in 

of a set of source data packets destined for the same empty departure buffers 205, 206 207. Buffer 208 is the only 

destination buffer have been processed. Embodiments of the departure buffer contaimng a ceU that can be propagated, 

invention also simultaneously sort and merge, within a Source buffers 50, 52, 54, 56 send cells to first stage 

switching element, a multiplicity of sorted streams of pack- 35 switching elements 60, 62, 64, 66 as long as there is room 

ets flowing on an input line into a multiplicity of sorted in the arrival buffers of these switching elements. If a source 

streams of packets flowing on an output line. The locally- t'^er is empty, a floor indication is sent; in a practical 

performed sorting is furOier distributed among a plurality of embodiment the timestamp value of the floor indication is 

switching elements forming a multistage interconnection simply the current time at that input port, 

network, so that packets received from a plurality of network 40 These ceUs and floor indications are received by the 

sources are properly resequenced at each of a plurality of switch element (Step C). For instance, the ceU in slot 106 of 

network destinations at the other side of the multistage departure buffer 92 in FIG. 3 has moved to slot 220 of arrival 

interconnection network. buffer 200 in FIG. 4. When a floor indication arrives at a 

1. A Multistage Network with Distributed Resequencing non-full arrival buffer, it is stored at the end of the arrival 

FIG. 3 iUustrates an exemplary multistage network that 45 ^^^^^ ^ the arrival buffer afready contains a floor 

implements distributed resequencing in accordance with the indication as the last element, then the newly arrived floor 

invention. A set of source buffers 50, 52, 54, 56 is followed indication replaces the old one. If a cell arrives at an arrival 

by a three-stage network made up of four-port (d=4) switch ^^^^ containing a floor indication as the last element, the 

elements 60, 62, 64, 66; 70, 72, 74, 76; and 80, 82, 84, 86. cell replaces the floor mdication. 

Each switch element has a two-slot arrival buffer 90 at each 50 The switch element then performs a processing cycle 

of its input ports and a four-slot departure buffer 92 at each (Steps D, E, F, G, H). The switch element determines the 

of its output ports. Each buffer slot 96, 98, 100, 102, 104, smallest timestamp value associated with any cell or floor 

106 can contain a single fixed length packet, or cell. Small indication in its arrival buffers (Step D). For example, in 

buffers and fixed le ngth packe ts (known as ceUs) are used FIG- 3. switch element 82 determines that the smallest 

pUrpOSeSj here for illustratioq [purposes,^ y)me embodiments typicaUy 55 timestamp in its arrival buffers 160, 164, 166, 168 is "5." 

T ' have more slots per butter and may provide for variable The switch element next determines whether there is a 

length packets. In FIG. 3, the presence of a cell in a buffer cell having this smaUest timestamp (Step E), for example, 

slot, for instance, slot 98, is indicated by a pair of numbers. the ceU in slot 162 of switch element 82, If there is a cell 

The top number in each pair is a value representing an having the smaUest timestamp, then the switch element 

internal address, namely the designation number of the 60 moves the ceU to an appropriate departure buffer (Step F); 

network destination to which the cell is to be delivered. The for example, the ceU in slot 162 in FIG. 3 has moved to 

bottom number in each pair in a slot is a value representing departure buffer 170 in FIG, 4. If this smallest timestamp is 

the ceU*s timestamp, which is the lime when the ceU passed associated only with a floor indication, then processing is 

from the source buffer 50, 52, 54, or 56 (where it originated) discontinued until the next cell time. For example, in switch 

to the first stage switch element 60, 62, 64, or 66. If a buffer 65 element 80 in RG. 3, the smallest timestamp in any arrival 

contains no cells, a status message packet substitute, which buffer is "2," which appears only in the floor indication in 

is simply a timestamp value without an output designation slot 110, and FIG. 4 shows that no cells have been moved 
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from any arrival buffer in switch element 80. The floor To support multicast communication, wherein copies of a 

indication "2" appears in empty departure buffers 112, 114, given packet are sent to multiple outputs (multipoint 

115 in FIG. 4. communication), the procedure for transferring packets from 

If multiple arrival buffers share the smallest timestamp arrival buffers to departure buffers must be modified. In 
value, as is the case for example in switch elements 72 and ^ particular, a packet of minimum timestamp must be for- 

86 in FIG. 3, then each of these arrival buffers is considered warded to all departure buffers on paths to its destinations, 
in turn. If the arrival buffer contains a cell, then that cell is involve storing multiple copies of the packet, 

advanced to the appropriate departure buffer. If instead the 1^ is sufficient to place a pointer to the stored packet in each 
arrival buffer contains a floor indication, nothing is moved appropriate departure buffers and store a reference 

from that buffer. For example, in switch element 72, the lo count with the packet, so that the storage space can be 

packets in slots 190 and 192 in switch element 72 (FIG, 3) released after the last required copy is sent. A switch element 

have moved to departure buffers 194 and 196, respectively, can use a variety of methods to decide to which local 

(FIG. 4); in switch element 86, the floor indication in slot departure buffers to forward a multicast packet to, including 

191 in FIG. 3 has not moved in FIG. 4, but it has been the use of explicit information stored in the packet header, 
overwritten by a new incoming cell. Moreover, the cell in 15 and/or the use of a multicast routing table with the switch 

slot 193 in FIG. 3 has moved to output buffer 195 in FIG. 4. element. 

Because the buffers in the switch element have limited So long as the different inputs assign timestamps to 

capacity, no cell will be forwarded from an arrival buffer if packets in increasing order, the coUective operation of the 

the required departure buffer has no space available. system guarantees that the packets deUvered at any output of 

The selection of the appropriate departure buffer depends the last stage switch elements will have non-decreasing 

on the stage in the network at which the switch element is timestamps. 

located. In a three-stage network, a typical objective in the The timestamps assigned by different network inputs do 

first stage is to distribute the traffic as evenly as possible. not have to be strictly synchronized. If one input is slightly 

Thus, a switch element in the first stage 60, 62, 64, 66 routes ahead of another, this simply means that its packets will be 

cells to a departure buffer containing the smallest nu mbei^of held back a little extra time by the network, so that they can 

t r/^ ^^cells, selecting a departure buffer randomly when no ^gle- be dehvered in timestamp order. 
IQ^ Pi/fft^rl^ufl^contains fewer cells than all others. In the secoSSina" Because the timestamps must be represented using a finite 

:>tvj ^ "TKIM stages, a switch element 70, 72, 74, 76; 80, 82, 84, 86 number of bits, certain embodiments of the distributed 

routes each cell to the departure buffer for the outgoing link resequencing methods and apparatus must aUow for the 

that lies on the path to the destination output port specified possibility of timestamp values that wrap around due to the 

in the cell header. constraints of a finite representation. This problem can be 

Referring to FIG. 7, ff moving a cell to a departure buffer avoided if the network is operated so that no packet can be 

leaves any of the switch element's arrival buffers empty delayed more than some maximum time T from the time it 

during the processing cycle, the switch element places a first arrives at a source buffer to the time it is delivered to its 

floor indication in the empty arrival buffer (Step G). The destination. Soiang-as4lK;re is such a maximum delay, then 

timestamp of this floor indication is equal to the timestamp if more tha/^10g2 T)>its are used to represent the lOCi 7" 

of the last cell that was taken from the buffer. timestamps, aTiHT^Vafiie can always be unambiguously ^ 

If time remains in the cell time (Step H), the switch interpreted by comparing it to the current time, 
element may perform additional processing cycles. In a 2. Systems with Multiple Priorities 

practical embodiment a d-port switch element performs at in an embodiment of the invention, preferential treatment 

least d processing cycles (Steps D, E, F, G, H) during a can be provided for some traffic relative to other traffic by 

single ceU time, unless processing is stopped due to a floor assigning each packet passing through the system a priority 

indication having the minimum timestamp. and by handling packets with different priority values dif- 

Embodiments of the invention include networks that 45 ferently. Systems using distributed resequencing can accom- 
handle variable sized packets and networks with more than modate multiple priority classes in accordance with the 
three stages. For example, in a five-stage Benes or Clos scope and spirit of the invention. In an embodiment system 
network, typicaUy the first two stages perform the traffic with m different priority classes, each buffer in FIG. 3 is 
distribution function, while the last three route packets based replaced by a collection of m buffers, where each buffer 
on their destination. In general, the first k-1 stages of a 2k-l so contains packets of a different priority class. FIG. 8 shows 
stage network perform traffic distribution, while the last k an example four-port switch element 800 that supports three 
stages route packets based on their destination. This method priority classes. Each source port 801-804 can supply pack- 
can of course also be applied to networks in which the ets to any one of three buffers; for instance, source port 801 
switch elements have different dimensions (numbers of can supply packets to arrival buffers 811, 812, 813. 
inputs and outputs) and different buffer capacities and orga- 55 Likewise, each output port 805--808 can forward packets 
nizations. from one of three buffers; for instance, output port 805 can 

The description above focuses on switch elements in forward packets from departure buffers 851, 852, and 853. 
which the arrival buffers and departure buffers are all of In FIG. 8, a packet is represented by a pair of numbers 
fixed size, and there is no sharing of memory space. separated by a period (e.g., in buffer 812) indicating the 
However, in keeping with the scope and spirit of the 60 destination and the timestamp in the format 
invention, distributed resequencing can also be applied to <destination>.<timestamp> in each field where there is a 
switch elements in which some of the storage space is destination. Floor indications are shown here as underlined 
shared, rather than reserved for the exclusive use of particu- numbers. Each buffer is labeled PI, P2, or P3 to indicate 
lar inputs or outputs. It is required that each buffer have at priority, with PI being highest priority, 
least one dedicated storage slot, in order to prevent dead- 65 Many different embodiments of systems handling mul- 
lock. The required modifications of the above procedures are tiple priority classes are possible in keeping with the scope 
straightforward. and spirit of the invention. One such exemplary embodiment 
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buffers as possible during a given cell time. To provide fair suppressed. Note that the source buffers 350, 352, 354, 356 

treatment of all network outputs, the switch element can use are still organized on a per output basis. In this system, the 

round robin scheduling when selecting an active output first stage switch elements operate like the first stage switch 

during each processing cycle. elements in the system of FIG. 3, while the third stage switch 

From lime to time, floor indications must be sent from 5 elements operate like the third stage switch elements in the 

each switch element to its downstream neighbor elements 'y^'^^. ™' The center stage switch elements provide 

for aU empty buffers. In a large system >^th per output ^''^'^Tl^fu ? I J" "^u ^^-5^^ 

buffers, the number of floor indications that must be sent can ^^j^^ °^^P"^ ^"^^^ ^ 

become prohibitive. However, as will be discussed below, in Providing effective traffic isolation in the system of FIG. 

the context of systems with reduced per output buffers, this 10 6 requu-es the introduction of a per output flow control 

problem can be avoided by combining and transmitting floor mechamsm to prevent departure buffers in the center stage 

indications in a bandwidth-efficient manner. ^^^^^ elements firom becommg fuU. In particular, when the 

™ ur *uff -J- buffers in the center stage switch elements for a particular 

The number of separate buffers required in a switch u^^^ i a n a * i • i - 

, ^ . 17 • * * 1 r network output become nearly fiill, a flow control signal is 

element can become large. For mstance, if the network of . „„ „f ^ , --X ° „. 

r,r-> -5 j-c J * J . l ^ 15 scnt to all oi the source buffers 350, 352, 354, 356, telhne 

FIG. 3 were modified to provide per output buffers, each first „ i * * *u *5 

. , ^ fgy Km . Ihem to suspend transmitting packets to the congested 

stage switch element 60 62. 64, 66 would require 128 ^ ^ ^^^^^g ^1^^ ,^ |, ^ 

distinct buffers (64 amval buffets and 64 departure buffers) , ^„ ^^1,^ » ' 

whileeachse^ndstageswitchekmentTO 72.74,76would ^ transmission of packets to output porf 1. More 

require 80 buffers (64 amval buffers and 16 output buffers) 4U„™ ,v „ ^ « ^ f Tk ^ 

. u 4U* J 4 u 1 . on o/ o/: ij 20 precisely, there is a threshold defined for the departure 

and each thu:d stage swuch eletnent 80 K. 84. 86 would ^^^^ ^^j^^^ ^1^^^^^^ ^^^^ 

require 20 buffers 16 arrival buffers and 4 output buffers). ^^^^er of packets in a buffer exceeds the threshold, a flow 

In general, a multistage interconnection network with d-por , ^^1^^,^ ^^ ^^^^ 

swiicn elements, ZK-i stages ana n=a requires zan distinct coneested output. After the number of packets in the buffer 

buffers u, each of the first k-1 stages, dn+n .n the center ^ ^^^^ ^elow the threshold again, the network inputs are 

stage and n+n/dm tne stage tollowmg the center stage. allowed to resume transmission of packets to the congested 

Switch elements in subsequent stages require 1/d times the ♦♦ai « i i_. 

u r t. • 11 . . , output. As long as flow control is exerted early enough to 

number of buffer as switch elements m the mimediately , ^ ^^^^^ 

pre^dmg stage; thus, the last stage switch elements require gyj^ ^^^^ ^„ maintain effective traffic isolation 

d +d buffers each. To reduce memory requuements, switch fu^ Tn,-^ ^ u i ♦ 

. . L n t_ . 1 J - 1 J , 30 among the difierent outputs, rhis requires enough packet 

element buffers can be miplemented using hnked list tech- ^ ^j^^ ^ ^^^^ ^1^^^^, ^^^^^^ thatthe buffer 

mques so that only the hst headers need to be rephcated for ^ „j ^^^^^ ^^^^^ j^^^l 

each network output. This makes it pracUcal for a switch • . *u *u u u j ♦u ♦u • o c i * 

. . , ^ , , rr, nses above the threshold and the time the inflow of packets 

element to uuplement several thousand distmct buffers. To „ ^ 

, ^ , 1 , rr- . lo ^ congested output stops, 

ensure that deadlock cannot occur, each buffer must have at ^ • i T • / ^ . «. • • i 

least one packet storage slot dedicated to it. To "nplement this reduced per output buffermg m the 

system of FIG. 3, each first Stage switch element requires 8 

A system with per output buffers can also support m separate buffers, each second stage switch element requires 

pnonty classes by further mcreasing the number of buffers. 20 buffers and each third stage switch element requires 20 

In particular, to convert a smgle priority system with per buffers. To avoid deadlock, each of these buffers must have 

output buffers to a system supporting m priority classes at least one dedicated packet storage slot. The approach can 

. '^"^l"^^ that each buffer be replaced with m buffers^for ^e generalized to systems with more than three stages. In a 

jnCrC ^ each class. The smgle pnority case is discussec^^ere^he system with d-port switch elements, 2k-l stages, and n=d*, 

/ extension to multiple pnonties is straightforward. t^e switch elements in the fii^t k-1 stages of this network can 

4. Systems with Reduced Per Output Buffers use shared buffers, so each of these switch elements has 2d 

The number of distinct buffers needed for per output 45 buffers. The switch elements from the middle stage onward 

queuing can become large enough to limit the size of can use per output buffers. In this case, each center stage 

systems that can be buflt. Thus, it is significant that the switch element will have d+n buffers, each switch element 

number of buffers needed in each switch element can be in the next stage wiU have n+n/d buffers and switch elements 

substantially reduced by restricting the use of per output in each subsequent stage will have 1/d times the number in 

buffers to the final stages of the network, where they are 50 the immediately preceding stage. The approach can also be 

most beneficial. An embodiment with reduced per output generalized further to have per output buffering start later 

buffers is illustrated in FIG. 6, which shows a three-stage than the center stage. In this case, the per output flow control 

network with two-port switch elements. In this network, the must be implemented at the stage where the transition from 

first stage switch elements 360, 362 have just a single buffer shared to per output buffers occurs. 

per input port 363, 364 and a single buffer per output port 55 With per output buffering, a switch element must send 

365, 366, as in the system of FIG. 3. In these switch floor indications to its downstream neighbors for each of its 

elements, each packet is represented by a decimal number, departure buffers. The amount of information that must be 

as in buffer 363: the destination appears before the decimal exchanged because of this can become impracticably large, 

point, and the timestamp appears after it. Floor indications With reduced per output buffering, it is possible to greatly 

are represented as underlined whole numbers, as in buffer 60 reduce the amount of floor information that must be sent by 

367, The second stage switch elements 370, 372 have a combining floor indications for different departure buffers in 

single arrival buffer per input port 373, 374, but have the form of floor vectors, which are generated by each 

separate departm-e buffers for each network output 375-378. switch element in the stage where the transition from shared 

The third stage switch elements have separate arrival buffers buffers to per output buffers occurs, typicaUy the center 

383-386 and departure buffers 387, 388 for each reachable 65 stage. In any one of these switch elements, for instance, in 

output. In buffers dedicated to a single network output, for switch element 370 of FIG. 6, all empty departure buffers 

instance, buffer 378, the destination of each packet has been must have the same timestamp value for their floor indica- 
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tamp value, always discontinuing forwarding of said 
one or more data packets for the remaining duration of 
a current cell time; and 
in response to identifying that a particular data packet of 
said data packets has associated therewith the earliest 
limestamp value, forwarding the particular data packet 
during the current cell time; wherein said forwarding 
the particular data packet during the current cell time 
includes removing the particular data packet from an 
arrival buffer and if said removing causes the arrival 
buffer to become empty, in response adding a new floor 
indication to the arrival buffer. 

2. The method of claim 1, wherein time remains in the 
current cell time to forward at least one of said one or more 
data packets when said discontinuing forwarding of said one 
or more data packets during the current cell time is per- 
formed. 

3. An apparatus for distributed resequencing of packets in 
a packet switching system, the method comprising: 

means for identifying one or more floor indications 
received by a switching element, each of said one or 
more floor indications associated with a respective 
limestamp value; 

means for identifying one or more data packets received 
by the switching element, each of said one or more data 
packets associated with a respective timestamp value; 

means for finding an earliest timestamp value associated 
with said one or more floor indications and said one or 
more data packets; 

means for always discontinuing forwarding of said one or 
more data packets for the remaining duration a current 
cell time in response to making a determination that not 
one of said data packets has associated therewith the 
earliest timestamp value; and 

means for forwarding the particular data packet during the 
current cell time in response to identifying that a 
particular data packet of said data packets has associ- 
ated therewith the earliest timestamp value; wherein 
said means for forwarding the particular data packet 
during the current cell time includes means for remov- 
ing the particular data packet from an arrival buffer and 
if said removing causes the arrival buffer to become 
empty, in response adding a new floor indication to the 
arrival buffer. 

4. The apparatus of claim 3, wherein time remains in the 
current cell time to forward at least one of said one or more 
data packets when said discontinuing forwarding of said one 
or more data packets during the current cell time is per- 
formed. 

5. A method for distributed resequencing of packets in a 
packet switching system, the method comprising: 
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identifying one or more floor indications received by a 
switching element, each of said one or more floor 
indications associated with a respective timestamp 
value; 

identifying one or more data packets received by the 
switching element, each of said one or more data 
packets associated with a respective timestamp value; 

finding an earliest timestamp value associated with said 
one or more floor indications and said one or more data 
packets; 

in response to making a determination that not one of said 
data packets has associated therewith the earliest times- 
tamp value, always discontinuing forwarding of said 
one or more data packets for the remaining duration of 
a current cell time; and 

in response to identifying that a particular data packet of 
said data packets has associated therewith the earliest 
timestamp value, forwarding the particular data packet 
during the current cell time. 

6. The method of claim 5, wherein time remains in the 
current cell time to forward at least one of said one or more 
data packets when said discontinuing forwarding of said one 
or more data packets during the current cell time is per- 
formed. 

7. An apparatus for distributed resequencing of packets in 
a packet switching system, the method comprising: 

means for identifying one or more floor indications 
received by a switching element, each of said one or 
more floor indications associated with a respective 
timestamp value; 
means for identifying one or more data packets received 
by the switching element, each of said one or more data 
packets associate d with a respective t imestamp value; 

^n din^^ n earliestj[toesta^)value associated 
with saiH one or more floor indications and said one or 
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means for always discontinuing forwarding of said one or 
more data packets for the remaining duration of a 
current cell time in response to making a determination 
that not one of said data packets has associated there- 
with the earliest timestamp value; and 
means for forwarding the particular data packet during the 
current cell time in response to identifying that a 
particular data packet of said data packets has associ- 
ated therewith the earliest timestamp value. 
8. The apparatus of claim 7, wherein time remains in the 
current cell time to forward at least one of said one or more 
data packets when said discontinuing forwarding of said one 
or more data packets during the current cell time is per- 
formed. 



COMMUNICATIONS INTERCONNECTION NETWORK WITH 
DISTRIBUTED RESEQUENCING 

STATEMENT AS TO RIGHTS T(X b^VENTI ONS)MADE UNDER 
5 UNIVERSITY SPONSORED RESEARCHAPnTDEVELOPMENT 

The invention was performed in the course of work under sponsorship 

of Washington University of St, Louis, Missouri, and Growth Networks Inc., of 

Mountain View, California. Each entity claims joint ownership in the invention. 

Growth Networks Inc. has exclusive rights under license from Washington 

10 University. 

FIELD OF THE INVENTION 

This invention relates to conmiunications information routing; and 
more specifically to switch interconnection networks or switch fabrics. 

15 

BACKGROUND OF THE INVENTION 

Buffered muhistage interconnection networks are often used in 
Asynchronous Transfer Mode ("ATM") and other types of packet switching systems. 
Networks of this type use buffers to store packets at intermediate points when 

20 contention for output links prevents immediate transmission. As used herein, the term 
packet is used to indicate generically addressable data packets of all types, including 
fixed length cells and variable length packets. 

Many muhistage interconnection networks provide multiple paths 
between network inputs and outputs, allowing the traffic to be balanced across the 

25 alternative paths. An example of such a network is shown in Figure 1 A, which is a 
depiction of an architecture known as a three stage Benes network. The Beneg 
network 10 is composed of three stages of switch elements (SE) 12, 14, 16 and two 
webs of interconnecting links (IL) 18, 20. Switch elements 12, 14, 16 may have any 
number of input type interconnecting links 24 and output type interconnecting links 

30 26. Letting d denote the number of input links and the number of output links in a 
single switch element and letting w denote the number of input links and output links 
of the multistage network as a whole. Figure 1 A illustrates d=4 and /i=16. Traffic 

2 
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IN THE SPECIFICATION: 

Please replace the following paragraph beginning on page 4, line 18, with the 
following paragraph which is marked-up to indicate the changes: 

Merge sorting, illustrated in Fig. IB, is a known technique for producing a 
single sorted list from multiple ordered lists whose values are known a priori. For example, 
two lists 920 and 922 of known elements sorted in ascending order can be combined into a 
single sorted list 924 by repeatedly taking the smaller value from the tep front of lists 920 and 
922, and appending the smaller value to the end of list 924. This example can be extended to a 
set ofn known values, which can be sorted by first dividing the set into n lists containing a 



single value each, then combining pairs of lists to produce «/^^Hsts)with two values each. Pairs 

of these lists are then merged, producing n/4 lists with four values each. Continuing in this 
fashion eventually yields a single sorted list containing the original values, but in sorted order, 
as shown in Figure IB. Merge sortin^^also be implemented using three-way merging (that 
is, merging three sortecJ^Hsts^nto a single sorted list in one step), rather than by using two-way 




merging. MoT^^^cralWd-v/ay merging can be used for any integer d>l. 
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the invention with only a single output and with no indication of the timing or control 
of the merging streams of timestamps. Other embodiments of the invention provide 
for multiple outputs and/or accommodating the timing or control of the merging 
streams of timestamps. 

5 Certain embodiments of the invention fiirther sense empty lists by 

means of a status message. In one such embodiment, a status message is an inserted 
packet having a timestamp value, as hereafter explained. A controller uses the 
timestamp value of the inserted packet to assure that all members of a set of source 
data packets destined for the same destination buffer have been processed. 

10 Embodiments of the invention also simultaneously sort and merge, within a switching 
element, a mukiplicity of sorted streams of packets flowing on an input line into a 
multiplicity of sorted streams of packets flowing on an output line. The locally- 
performed sorting is further distributed among a plurality of switching elements 
forming a multistage interconnection network, so that packets received firom a 

15 plurality of network sources are properly resequenced at each of a plurality of 
network destinations at the other side of the multistage interconnection network. 

1. A Multistage Network with Distributed Resequencing 
Figure 3 illustrates an exemplary multistage network that implements 
distributed resequencing in accordance with the invention. A set of source buffers 50, 

20 52, 54, 56 is followed by a three-stage network made up of four-port {d - 4) sv^dtch 
elements 60, 62, 64, 66; 70, 72, 74, 76; and 80, 82, 84, 86. Each switch element has a 
two-slot arrival buffer 90 at each of its input ports and a four-slot departure buffer 92 
at each of its output ports. Each buffer slot 96, 98, 100, 102, 104, 106 can contain a 
single fixed length packet, or cell. Small buffers and fixed length packets (known as 

25 cells) are used here for illustration^^^osespsome embodiments typically have more 
slots per buffer and may provide for variable length packets. In Fig. 3, the presence of 
a cell in a buffer slot, for instance, slot 98, is indicated by a pair of numbers. The top 
number in each pair is a value representing an internal address, namely the 
designation number of the network destination to which the cell is to be delivered. 

30 The bottom number in each pair in a slot is a value representing the cell's timestamp, 
which is the time when the cell passed fi-om the source buffer 50, 52, 54, or 56 (where 
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indication in slot 191 in Fig. 3 has not moved in Fig. 4, but it has been overwritten by 
a new incx)ming cell. Moreover, the cell in slot 193 in Fig. 3 has moved to output 
buffer 195 in Fig. 4, Because the buffers in the switch element have limited capacity, 
no cell will be forwarded from an arrival buffer if the required departure buffer has no 

5 space available. 

The selection of the appropriate departure buffer depends on the stage 
^ in the network at which the switch element is located. In a three-stage network, a 
typical objective in the first stage is to distribute the traffic as evenly as possible. 
Thus, a switch element in the first stage 60, 62, 64, 66 routes cells to a departure 

10 buffer containing the smallest number of cells, selecting a departure buffer randomly 
when nc| (^ngle bu^ r)x>ntains fewer cells than all others. In the second and third 
stages, a switch element 70, 72, 74, 76; 80, 82, 84, 86 routes each cell to the departure 
buffer for the outgoing link that lies on the path to the destination output port 
specified in the cell header. 

1 5 Referring to Fig. 7, if moving a cell to a departure buffer leaves any of 

the switch element's arrival buffers empty during the processing cycle, the switch 
element places a floor indication in the empty arrival buffer (Step G). The timestamp 
of this floor indication is equal to the timestamp of the last cell that was taken from 
the buffer. 

20 If time remains in the cell time (Step H), the switch element may 

perform additional processing cycles. In a practical embodiment a d-povt switch 
element performs at least d processing cycles (Steps D, E, F, G, H) during a single 
cell time, unless processing is stopped due to a floor indication having the minimum 
timestamp. 

25 Embodiments of the invention include networks that handle variable 

sized packets and networks with more than three stages. For example, in a five-stage 
Bene§ or Clos network, typically the first two stages perform the traffic distribution 
function, while the last three route packets based on their destination. In general, the 
first k-l stages of a 2*-l stage network perform traffic distribution, while the last k 

30 stages route packets based on their destination. This method can of course also be 
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allow for the possibility of timestamp values that wrap around due to the constraints 
of a finite representation. This problem can be avoided if the network is operated so 
that no packet can be delayed more than some maximum time J from the time it first 
arrives at a source buffer to the time it is delivered to its destination. So long as there 
5 is such a maximum delay, then if more thaii(^^its are used to represent the 

timestamps, a time value can always be unambiguously interpreted by comparing it to 
the current time. 

2. Systems with Multiple Priorities 

In an embodiment of the invention, preferential treatment can be 

10 provided for some traffic relative to other traffic by assigning each packet passing 
through the system a priority and by handling packets with different priority values 
differently. Systems using distributed resequencing can accommodate multiple 
priority classes in accordance with the scope and spirit of the invention. In an 
embodiment system with m different priority classes, each buffer in Figure 3 is 

1 5 replaced by a collection of m buffers, where each buffer contains packets of a 

different priority class. Figure 8 shows an example four-port switch element 800 that 
supports three priority classes. Each source port 801-804 can supply packets to any 
one of three buffers; for instance, source port 801 can supply packets to arrival buffers 
81 1, 812, 813. Likewise, each output port 805-808 can forward packets from one of 

20 three buffers; for instance, output port 805 can forward packets from departure buffers 
851, 852, and 853. In Figure 8, a packet is represented by a pair of numbers separated 
by a period (e.g., in buffer 812) indicating the destination and the timestamp in the 
format <destination>.<timestamp> in each field where there is a destination. Floor 
indications are shown here as underiined numbers. Each buffer is labeled PI, P2, or 

25 P3 to indicate priority, with PI being highest priority. 

Many different embodiments of systems handling multiple priority 
classes are possible in keeping with the scope and spirit of the invention. One such 
exemplary embodiment system with muhiple priority classes differs in the following 
respects from a typical embodiment system with a single priority class. 

^0 If an output port of a given switch element has packets waiting in 

several of its buffers, one of the buffers is selected and the first packet in the selected 
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From time to time, floor indications must be sent from each switch 
element to its downstream neighbor elements for all empty buffers. In a large system 
with per output buffers, the number of floor indications that must be sent can become 
prohibitive. However, as will be discussed below, in the context of systems with 

5 reduced per output buffers, this problem can be avoided by combining and 
transmitting floor indications in a bandwidth-efficient manner. 

The number of separate buffers required in a switch element can 
become large. For instance, if the network of Figure 3 were modified to provide per 
output buffers, each first stage switch element 60, 62, 64, 66 would require 128 

10 distinct buffers (64 arrival buffers and 64 departure buffers), while each second stage 
switch element 70, 72, 74, 76 would require 80 buffers (64 arrival buffers and 16 
output buffers) and each third stage switch element 80, 82, 84, 86 would require 20 
buffers (16 arrival buffers and 4 output buffers). In general, a multistage 
interconnection network with J-port switch elements, 2k-l stages, and n=rf^ requires 

1 5 2cin distinct buffers in each of the first k-l stages, dn^n in the center stage, and rt+n/d 
in the stage following the center stage. Switch elements in subsequent stages require 
l/rf times the number of buffers as switch elements in the immediately preceding 
stage; thus, the last stage switch elements require d buffers each. To reduce 
memory requirements, switch element buffers can be implemented using linked list 

20 techniques so that only the list headers need to be replicated for each network output. 
This makes it practical for a switch element to implement several thousand distinct 
buffers. To ensure that deadlock cannot occur, each buffer must have at least one 
packet storage slot dedicated to it. 

A system with per output buffers can also support m priority classes by 

25 further increasing the number of buffers. In particular, to convert a single priority 
system with per output buffers to a system supporting m priority classes requires that 
each buffer be replaced with m buffers, one for each class. The single priority case is 
discussed^^j^the extension to multiple priorities is straightforward. 
4. Systems with Reduced Per Output Buffers 

30 The number of distinct buffers needed for per output queuing can 

become large enough to limit the size of systems that can be built. Thus, it is 
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36 (currently amended): An apparatus for distributed resequencing of packets in a 
packet switching system, the method comprising: 

means for identifying one or more floor indications received by a switching element, 
each of said one or more floor indications associated with a respective timestamp value; 

means for identifying one or more data packets received by the switching element, 
each of said one or more data packets associated with a respective timestamp value; 

means foi(findin|)an earliesl^imest^^alue associated with said one or more floor 
indications and said one or more data packets; 

means for always discontinuing forwarding of said one or more data packets during for 
the remaining duration of a current cell time in response to id e ntifying making a determination 
that not one of said data packets has associated therewith the earliest timestamp value; and 

means for forwarding the particular data packet during the current cell time in 
response to identifying that a particular data packet of said data packets has associated 
therewith the earliest timestamp value. 

37 (previously presented): The apparatus of claim 36, wherein time remains in the 
current cell time to forward at least one of said one or more data packets when said 
discontinuing forwarding of said one or more data packets during the current cell time is 
performed. 
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